Final Project - ERASMUS+ Mobility Program

Abstract

Exploratory Data Analysis

The dataset that we used is one from 2012-2013 academic year that can be found here. The dataset is published directly by European Union. It was created from the statistical reports of the national agencies of the 33 countries participating in the Erasmus+ program (Erasmus decentralised actions) and data provided by Education Audiovisual and Culture Executive Agency (Erasmus centralised actions). The data is generated during the application process of the student and then collected by the respective universites. It contains 267547 observations and has 34 different variables.
Host institution country is one of the most interesting variables to us and we can see that it has a lot of undefined values, around 55 thousand, so we need to filter those out. For both host and home country, values are coded as country codes. However Belgium is coded as three diferent values: “BEDE”, “BEFR” and “BENL” depending on the language area (Dutch, France or German). We are going to merge all of this values to a single one for whole Belgium.
There are 34 different vairables and we are not going to use all of them, so we list ones that are most relevant for our research:

First thing we wanted to explore is to see if there is a difference between number of male and females enrolled in Erasmus. We were expecting to see significant difference as one of the cited papers suggest that there is gender gap. Pie chart we presented here to confirm this assumption.

Next we wanted to see what are the countries with most students goint to Erasmus. In order to not just list them, we decided to present this metric in a Europe map, coloring each country regarding the number of students with home university in that country. We can see that Spain, France and Germany are leading in students enrolled in Erasmus. Surprising thing is to see that Turkey lists very high.

Other thing that was in our interest is the areas in which Erasmus is most popular. The dataset contains codes of each area adn we have used The International Standard Classification of Education to map those codes to names of areas. We have also merged areas that start with same two numbers since those are related and finally displayed statistic in form of bar plot.

To explore data further, we wanted to see age distibution. At that point we noticed that there was a student that attended Erasmus at the age of 93. There were some other unordinary records as 73 and 69 years old students. Despite that we present student distribution by age of 30 where most of the students are. 22 year old students were most frequent among males, and 21 year old students among females.

As soon as we had some general knowledge about the given data, we could get to the first part of our question.

Institution Code Students
S STOCKHO04 61
S LINKOPI01 58
E VALENCI02 56
S VAXJO03 54
PL WARSZAW02 53
DK RISSKOV06 42

Methods

In this section we discuss how we planned to answer our question. In order to be able to do so we first explored the strength of relationships within the dataset.